Data Source : https://www.kaggle.com/jboysen/us-perm-visas
#data processing
import pandas as pd
import seaborn as sns
#special visulaization
import missingno as msno
import matplotlib.pyplot as plt
import collections
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud
import requests
import imageio
from io import BytesIO
import random
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
%matplotlib inline
from IPython.display import Image
Image(filename = "results.png")
As immigrants in USA, the biggest concern is regarding the status of our VISA. There are many factors impacting the results of decision. The dataset has around 154 features but many of these features have no impact on decision.I focused on 18 features out of the 154 as alot of them were either not very insightful or had missing data. As a naive to analysis, I am working on creating an optimized model and visualize it so that the techie(more than 60% of the people applying are software engineers ) plan their visa process. In this process, I visualized the features to come up with insightful pattern but the most optimized result were generated by using XGBoost model(Machine learning model)to determine which feature is relavent as shown in figure 1 of the graph. The other two graphs are the top 2 features which depicts that if the decision is between March to September, there is more than 90% chances of acceptance whereas in rest of the month its less than 20%. Similarly, if the case was started between November to February there is more than 80% chance of acceptance than in other months.These individual interpretation with help of the model can result in generating better possibility of getting the VISA accepted.
fields=['application_type', 'case_status', 'class_of_admission','country_of_citizenship','decision_date','employer_state','employer_name','job_info_work_city','pw_soc_title','us_economic_sector','case_received_date']
data=pd.read_csv('us_perm_visas.csv',usecols=fields)
df= data[['application_type', 'case_status', 'class_of_admission','country_of_citizenship','decision_date','employer_state','employer_name','job_info_work_city','pw_soc_title','us_economic_sector','case_received_date']].copy()
df.head()
df.shape
df.columns
df.isnull().sum()
msno.matrix(df)
df.describe()
df=df.drop(['application_type'],axis=1)
df = df.replace(np.NaN,'null')
df.dtypes
df['employer_state'].value_counts()[:10].plot(kind='barh').invert_yaxis()
import plotly.graph_objects as go
fig = go.Figure(data=[go.Pie(labels=df['pw_soc_title'],
values=df['pw_soc_title'].value_counts()[:5])])
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
marker=dict( line=dict(color='#000000', width=2)))
fig.show()